Abstract
The Iris dataset, introduced by Ronald Fisher in 1936, contains measurements of 150 flowers across three species: setosa , versicolor , and virginica . This paper presents an interactive exploratory analysis demonstrating how petal and sepal measurements distinguish these species. Our analysis shows that while setosa is easily separable, distinguishing versicolor from virginica requires more sophisticated approaches.
Example scroll made with Quarto for Scroll Press.
Analysis
Code
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Load dataset
iris = load_iris()
df = pd.DataFrame(data= iris.data, columns= iris.feature_names)
df['species' ] = iris.target
df['species_name' ] = df['species' ].map ({0 : 'setosa' , 1 : 'versicolor' , 2 : 'virginica' })
Feature Relationships
Interactive exploration reveals strong correlations between petal measurements and clear separation of setosa from other species.
Code
fig = px.scatter_matrix(
df,
dimensions= iris.feature_names,
color= 'species_name' ,
labels= {col: col.replace(' (cm)' , '' ) for col in iris.feature_names},
color_discrete_map= {'setosa' : '#636EFA' , 'versicolor' : '#EF553B' , 'virginica' : '#00CC96' },
height= 600 , width= 800
)
fig.update_traces(diagonal_visible= False )
fig.show()
Dimensionality Reduction
PCA reveals that the first two principal components explain 96% of variance, enabling effective 2D visualization.
Code
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)
pca = PCA(n_components= 2 )
X_pca = pca.fit_transform(X_scaled)
pca_df = pd.DataFrame(data= X_pca, columns= ['PC1' , 'PC2' ])
pca_df['species' ] = df['species_name' ]
fig = px.scatter(
pca_df, x= 'PC1' , y= 'PC2' , color= 'species' ,
title= f'PCA (Explained Variance: { sum (pca.explained_variance_ratio_):.1%} )' ,
labels= {'PC1' : f'PC1 ( { pca. explained_variance_ratio_[0 ]:.1%} )' ,
'PC2' : f'PC2 ( { pca. explained_variance_ratio_[1 ]:.1%} )' },
color_discrete_map= {'setosa' : '#636EFA' , 'versicolor' : '#EF553B' , 'virginica' : '#00CC96' },
height= 500 , width= 700
)
fig.update_traces(marker= dict (size= 10 ))
fig.show()
Classification
Logistic regression achieves 97% accuracy on test data, with errors concentrated in the versicolor -virginica boundary region.
Code
# Train classifier
X_train, X_test, y_train, y_test = train_test_split(X_pca, iris.target, test_size= 0.2 , random_state= 42 )
lr = LogisticRegression(max_iter= 200 )
lr.fit(X_train, y_train)
# Create mesh
h = 0.02
x_min, x_max = X_pca[:, 0 ].min () - 1 , X_pca[:, 0 ].max () + 1
y_min, y_max = X_pca[:, 1 ].min () - 1 , X_pca[:, 1 ].max () + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = lr.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
# Plot
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Contour(
x= xx[0 ], y= yy[:, 0 ], z= Z,
colorscale= [[0 , '#636EFA' ], [0.5 , '#EF553B' ], [1 , '#00CC96' ]],
opacity= 0.3 , showscale= False , hoverinfo= 'skip'
))
for idx, name in enumerate (['setosa' , 'versicolor' , 'virginica' ]):
mask = iris.target == idx
fig.add_trace(go.Scatter(
x= X_pca[mask, 0 ], y= X_pca[mask, 1 ],
mode= 'markers' , name= name,
marker= dict (size= 8 , color= ['#636EFA' , '#EF553B' , '#00CC96' ][idx])
))
fig.update_layout(title= 'Decision Boundaries' , xaxis_title= 'PC1' , yaxis_title= 'PC2' , height= 500 , width= 700 )
fig.show()
Conclusion
The Iris dataset demonstrates fundamental machine learning concepts: setosa is linearly separable, while versicolor and virginica overlap in feature space. Interactive visualizations reveal these relationships clearly, making this dataset an enduring pedagogical example.
Example scroll made with Quarto for Scroll Press.